We're going to look at the same data set from Lending Club but ask a different question. One that has a binary outcome.
Let's assume we have a FICO Score of 720 and we want to borrow 10,000 dollars. We would like to get an Interest Rate less that 12 per cent.
The question we pose is:
How do we use Logistic Regression here? Let's recast the problem as follows:-
Then let us decide that if we get a probability of less than 0.67 we say it means we won't get the loan and if it is greater than 0.67 we will. I.e. we are not confident until we have a 2/3 chance of getting it.
In reality we can set the threshold higher, say 0.8, if we want to be "more certain" that it will happen, but for this exercise we'll just say 0.67.
From initial discussion we say we want to start with a model of the form
$Interest Rate = a_0 + a_1*FICOScore + a_2*LoanAmount$
And the derive a second equation of the form:
Z = Prob (InterestRate less than 12 percent).
We apply this to the existing dataset and create a Logistic Regression Model using modeling software.
In [9]:
import pandas as pd
dfr = pd.read_csv('../datasets/loanf.csv')
# inspect, sanity check
dfr.head()
Out[9]:
In [10]:
# we add a column which indicates (True/False) whether the interest rate is <= 12
dfr['TF']=dfr['Interest.Rate']<=12
# inspect again
dfr.head()
# we see that the TF values are False as Interest.Rate is higher than 12 in all these cases
Out[10]:
In [11]:
# now we check the rows that have interest rate == 10 (just some number < 12)
# this is just to confirm that the TF value is True where we expect it to be
d = dfr[dfr['Interest.Rate']==10]
d.head()
# all is well
Out[11]:
Now we use our Logistic Regression modeler software to create Logit model using this data, with the 'TF' column as the dependent (or response) variable and 'FICO.Score' and 'Loan.Amount' as independent (or predictor) variables.
In [12]:
import statsmodels.api as sm
# statsmodels requires us to add a constant column representing the intercept
dfr['intercept']=1.0
# identify the independent variables
ind_cols=['FICO.Score','Loan.Amount','intercept']
logit = sm.Logit(dfr['TF'], dfr[ind_cols])
result=logit.fit()
We should see some soothing messages from our software re-assuring us that all went well
and giving us some numbers we may not find useful right now.
More importantly we want the results.
What are the fitted coefficients that the software has computed?
In [13]:
# get the fitted coefficients from the results
coeff = result.params
print coeff
The numbers above are the coefficients for the respective independent, i.e. predictor, variables in the linear expression we saw in the Overview. Except, we now have two instead of one predictor. We have multivariate linear regression. The dtype: float64 is metadata telling us that all these variables are of type float64.
So, using the above coefficients, the linear part of our predictor is
$$z = -60.125 + 0.087423*FicoScore -0.000174*LoanAmount$$Finally, the probability of our desired outcome, ie our getting a loan at 12% interest or less, is
$$p(z) = \frac{1}{1 + e^{b_0 + b_1*FicoScore + b_2*LoanAmount}}$$where $b_0 = −60.125, b_1 = 0.087423$ and $b_2 = −0.000174$
We create a function in code that encapsulates all this.
It takes as input, a borrowers FICO score, the desired loan amount and the coefficient vector from our model. It returns a probability of getting the loan, a number between 0 and 1.
In [14]:
def pz(fico,amt,coeff):
# compute the linear expression by multipyling the inputs by their respective coefficients.
# note that the coefficient array has the intercept coefficient at the end
z = coeff[0]*fico + coeff[1]*amt + coeff[2]
return 1/(1+exp(-1*z))
Now we use our data FICO=720 and Loan Amount=10,000 to get a probability using the z value and the logistic formula.
In [15]:
pz(720,10000,coeff)
Out[15]:
This value of 0.746 tells us we have a good chance of getting the loan we want, according to our criterion, where anything above 0.67 was considered a 'yes'.
Now we are going to try (fico, amt) pairs as follows:
In [16]:
print("Trying multiple FICO Loan Amount combinations: ")
print('----')
print("fico=720, amt=10,000")
print(pz(720,10000,coeff))
print("fico=720, amt=20,000")
print(pz(720,20000,coeff))
print("fico=720, amt=30,000")
print(pz(720,30000,coeff))
print("fico=820, amt=10,000")
print(pz(820,10000,coeff))
print("fico=820, amt=20,000")
print(pz(820,20000,coeff))
print("fico=820, amt=30,000")
print(pz(820,30000,coeff))
We see as somewhat expected that the person with a 720 FICO Score will have decreasing probability of getting loans with higher amounts. However, the person with the 820 FICO Score is very likely to get loans with those amounts, again as expected.
In [17]:
pz(820,63000,coeff)
Out[17]:
Try the following pairs of (fico, amt) values and plug them into the pz() function mimicing the syntax below. What insight does this give you?
Place your cursor on the cell below. Hit shift-enter to recreate the result.
Then click Insert->Cell Below via the Insert menu dropdown. This creates a new empty cell.
Now enter the pz() function with the next pair of values. Hit shift-enter.
Repeat this till the end of the list of values.
Answer the question above, if possible.
Then explore other pairs as you wish.
In [18]:
pz(820,50000,coeff)
Out[18]:
Use the supporting notebooks in the appendix to learn some plotting techniques and try to create a yes/no plot for loan amount on x-axis and probability of loan on the y-axis for a FICO score of 720. Do the same for a fico score of 820.
How would you create a plot that showed the probability of getting a loan as a function of both FICO score and loan amount varying? What tools would you need?
We see for the (720, 10000) case, a probability close to 0.7 which tells us that we have a good chance of getting the loan at a favorable interest rate. Using our threshold of 0.67 we count this as a 'yes'.
Using a Logistic Regression model, a desired Interest Rate of 12 per cent, we use dthe Lending Club dataset to compute a probability that we will get a 10,000 dollar loan with a FICO Score of 720. Our result indicated with a strong degree of certainty that we would be able to procure a loan with these terms.
When we try the multiple combinations we see the following:
In [19]:
from IPython.core.display import HTML
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[19]: